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Abstract 

Learning a kernel matrix from relative comparison human 
feedback is an important problem with applications in collab¬ 
orative filtering, object retrieval, and search. For learning a 
kernel over a large number of objects, existing methods face 
significant scalability issues inhibiting the application of these 
methods to settings where a kernel is learned in an online and 
timely fashion. In this paper we propose a novel framework 
called Efficient online Relative comparison Kernel LEarning 
(ERKLE), for efficiently learning the similarity of a large set 
of objects in an online manner. We learn a kernel from rela¬ 
tive comparisons via stochastic gradient descent, one query 
response at a time, by taking advantage of the sparse and 
low-rank properties of the gradient to efficiently restrict the 
kernel to lie in the space of positive semidefinite matrices. 
In addition, we derive a passive-aggressive online update 
for minimally satisfying new relative comparisons as to not 
disrupt the influence of previously obtained comparisons. Ex¬ 
perimentally, we demonstrate a considerable improvement 
in speed while obtaining improved or comparable accuracy 
compared to current methods in the online learning setting. 
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1 Introduction 

Learning a similarity model over a set of objects from human 
feedback is important to many applications in collaborative 
filtering, document and multimedia retrieval, and visualiza¬ 
tion. It has been shown that by incorporating human feedback, 
the overall performance of such applications can be greatly 
improved ini [la [HI [T5] |30l. In this work we focus on 
learning a similarity model from human feedback through 
relative comparisons. More specifically, we focus on the rela¬ 
tive comparison kernel learning (RCKL) problem, in which 
the goal is to learn a positive semidefinite (PSD) kernel ma¬ 
trix from relative comparisons given by humans. Kernels 
are used for modeling object relationships in many learning 
techniques ll23l . and hence are applicable to many methods 
that utilize kernels for these applications. 
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In learning a kernel from human supervision, it is 
important to obtain feedback which is intuitive for the user 
to provide and informative for a learning algorithm to use. 
For instance, naive forms of supervision such as numerical 
judgments between pairs of objects have been shown to be 
very noisy m- A relative comparison, the response to a 
query of the form “Is object A more similar to object B or C?”, 
is well known as an intuitive mechanism for soliciting human 
feedback and an effective way of learning similarity HD- 
Recent works addressing fine-grained categorization ED and 
perceptual visualization design El have shown the practicality 
and benefit of learning kernels from relative comparisons. 

Many RCKL methods IT] |2^ learn a kernel by solving 
a semidefinite program (SDP) in batch, where all obtained 
relative comparisons are required to learn the kernel. How¬ 
ever, in many practical applications, a batch approach is not 
appropriate due to the online and dynamic nature of the appli¬ 
cation. For example in crowdsourcing, it is often of interest 
to minimize the number of dispatched tasks and thus the cost 
of the crowd by leveraging active learning techniques ll25l |9l 
to adaptively select the most informative relative comparison 
query. The success of these techniques depends on maintain¬ 
ing an up to date model so as to ensure the most informative 
qnery is selected, as well as an efficient learning method to 
quickly update the model so that no crowd participant is idle. 
Likewise, recommendation systems for online marketplaces 
obtain continuous feedback in the form of click-through data 
via user interaction. In order for the learned kernel to be up to 
date and reflect the latest user feedback, the learning method 
must be able to quickly incorporate feedback as it is received. 

These scenarios motivate the need for an ejficient and on¬ 
line method for learning from large-scale relative comparison 
data. Batch methods poorly scale for large object collections 
primarily because they must ensure their solutions are PSD. 
Without any prior assumptions on the data this operation is 
of O(n^) time complexity for n objects, which for large n is 
prohibitively slow for the aforementioned applications. 

This work introduces a novel online RCKL framework 
called Efficient online Relative comparison Kernel LEarning 
(ERKLE) that sequentially updates a kernel one query 
response at a time in 0{n^) complexity. ERKLE employs 
stochastic gradient descent El for RCKL, taking advantage 
of the sparse and low-rank structure of the RCKL gradient 
over a single comparison to devise fast updates that only 
require finding the smallest eigenvector and eigenvalue of a 



suitable matrix. We show that the gradient structure, which 
enables such an efficient update, generalizes several well- 
known convex RCKL methods lfn i2^ . The structure of the 
gradient also reveals a simple way to bound the smallest 
eigenvalue after each gradient step, which often allows 
updates to be performed in constant time. Motivated by work 
in online learning Q, we also derive a passive-aggressive 
version of ERKLE to ensure learned kernels model the most 
recently obtained relative comparisons without over-fitting. 
In summary, our main contributions are: 

1. An online RCKL framework for large-scale similarity 
learning that generalizes many current RCKL methods. 

2. An efficient kernel update method with 0{n?) time 
complexity that exploits the unique structure of RCKL 
stochastic gradients when stochastic gradient steps may 
result in a non-PSD matrix. 

3. A passive-aggressive update procedure for online rela¬ 
tive comparison kernel learning 

4. An experimental evaluation that shows ERKLE has both 
improved performance and faster run times compared to 
batch RCKL methods. 

2 Related Work 

The problem of learning a kernel matrix, driven by relative 
comparison feedback, has been the focus of much recent work. 
Most recent techniques primarily differ by the choice of loss 
function. Eor instance. Generalized Non-metric Multidimen¬ 
sional Scaling HI employs hinge loss. Crowd Kernel Learn¬ 
ing 1251 uses a scale-invariant loss, and Stochastic Triplet 
Embedding l2^ uses a logistic loss function. 

The aforementioned RCKL methods can be viewed as 
solving a kernelized special case of the classic non-metric 
multidimensional scaling problem 01, where the goal is to 
find an embedding of objects in such that they satisfy 
given Euclidean distance constraints. In contrast to many of 
the kernel-learning formulations, their analogous embedding¬ 
learning counterparts are non-convex optimization problems, 
which only guarantee convergence to a local minimum. In the 
typical non-convex batch setting, multiple solutions are found 
with different initializations and the best is chosen among 
them. This strategy is poorly suited for the online setting 
where triplets are being observed sequentially, and which 
solution is best may change as feedback is received. 

In this work we consider the online RCKL problem, 
where one is sequentially acquiring relative comparisons 
among a large collection of objects. Stochastic gradient 
descent techniques ETll are a popular class of methods for 
online learning of high-dimensional data for a very general 
class of functions, where recent techniques 1^ l2^ have 
demonstrated competitive performance with batch techniques. 
In particular, recent methods |I8] (Tb) have developed efficient 
methods to solve SDPs in an online fashion. The work of ||4| 


shows how to devise efficient update schemes for solving 
SDPs when the gradient of the objective function is low-rank. 
We build upon and improve the efficiency of this work, by 
taking advantage of the sparse and low-rank structure of the 
gradient common in convex RCKL formulations. 

Our passive-aggressive step size procedure is similar 
to that which is introduced in Q for other online learning 
problems. In their work, the authors create a passive- 
aggressive online update rule for classic SVM formulations 
used in problems such as binary/multi-class classification 
and regression. In deriving such an update for different 
RCKL loss functions, we relate how different methods can 
be utilized under a common passive-aggressive framework. 
To our knowledge, such an update for RCKL problems and 
the associated analysis of RCKL methods has not been done. 

3 Preliminaries 

In this section, we formally define RCKL and provide a brief 
overview of RCKL methods. Let S'" be the set of n x n 
PSD matrices, and be the entry at row a column 6 of a 
matrix M. The goal of RCKL is to learn a PSD kernel matrix 
K e S" over n objects, given a set T of triplets: 

(3.1) T = {(a, b,c) \ a is more similar to b than c} 

such that squared distance constraints are satisfied: 

'^{a,b.c)er ■ b) < c) 

where d|;(a, b) = -f K“ - 2K“'’. 

We say a kernel K satisfies a triplet L = iO‘i,bi,Ci) G T if 
the constraint in ( |3.2| i corresponding to ti is satisfied. 

In this work, we consider triplets that are answers to 
relative comparison queries posed to one or more people. 
We define a query q to have three components, a “head” 
object h to be compared with two objects and o^. A 
query q= (fi, o^}) can be answered by either the triplet 
{h, 0 ^, 0 ^) or {h, 0 ^, 0 ^), indicating that h is more similar to 
than or h is more similar to than o^, respectively. It 
is desirable to learn a kernel that not only satisfies observed 
triplets, but also that generalizes to unseen triplets, leading to 
a learned kernel that models a more complete notion of the 
desired human similarity space. 

3.1 RCKL Formulation Many RCKL methods can be 
generalized by the following SDP: 

min L (K, T) + TTrace(K) 

(3.3) K " " 

s.t. K ^ 0. 

The objective function is composed of two terms. The first 
term is a function L measuring how much loss K incurs 
for not satisfying triplets in T. The second term is a trace 
regularization on K weighted by a hyperparameter r. Trace 


regularization is used as a convex approximation of the non- 
convex rank function. Higher values of r enforce that ( |3.3| l 
produces lower-complexity similarity models. Finally, K is 
constrained to be PSD. 

The loss function in the objective can be decomposed 
into the sum of losses over individual triplets: 

(3.4) L(K,r) = ^((K,f). 

tGT 

Existing RCKL methods differ in the choice of the loss 
function 1. The Stochastic Triplet Embedding (STE) approach 
of ESi defines ((K,f) = —logpY' as the loss function, 
where pY' is the probability that a triplet is satisfied: 

(n cx K ^ exp(-d^(a,&)) 

t={a,b,c) exp(—d^(a, &))-f exp(—d^(a, c)) 

Generalized Nonmetric Multidimensional Scaling (GNMDS) 
ifTi uses a hinge loss, where I (K, t = (a, b, c)) is defined as: 

(3.6) max(0, (>) — c)-I-1). 

For either loss function I, p.3| l is a convex optimization 
problem and the globally optimal solution is found by 
performing projected gradient descent, which consists of two 
update steps. The first step is a simple descending step along 
the gradient of the objective: 

(3.7) K'=K,_i-(5,(VL(K,_i,r)+Tl), 

where i denotes the current iteration, 6i is the learning rate. 
The second step projects the result of the first gradient step 
onto the PSD cone: 

(3.8) K, = Us^ (K'). 

These steps are iterated until convergence. 

4 Efficient Online Relative Comparison Kernel 
Learning (ERKLE) 

The main computational bottleneck of traditional RCKL meth¬ 
ods is the projection onto the PSD cone, Hg^. This projection 
is commonly found by first taking the eigendecomposition 
of K' = VA and setting all negative eigenvalues to 0, 
i.e. Ki = V[A]+V*, where [•]+ is defined entry-wise as 
[A"]_|_ = max(0, A“). Absent of any prior knowledge on 
the structure of K', its full eigendecomposition is necessary 
for the projection. Since this is an 0{n^) operation, the 
projection step renders batch methods computationally pro¬ 
hibitive for learning the similarity of a large number of objects 
in an online manner. 


stochastic gradient descent techniques 0 As shown in ( |3.4| i, 
the loss function L naturally decomposes into the sum over 
losses I defined on individual observations (triplets in our 
case). From this decomposition, ERKLE first performs the 
following stochastic gradient step: 


(4.9) 




where triplets ti, have been observed, Kj_i is the 

online solution after observing the j — 1 triplet. 

Performing a stochastic optimization gives ERKLE an 
advantage over current RCKL methods that perform batch 
optimizations. Batch methods attempt to minimize a loss 
function over a training set. This is known to minimize 
empirical risk with respect to the particular training samples, 
which is used as an estimate of expected risk over the ground 
truth distribution over all samples. Obtaining triplets in an 
online fashion from a source can be viewed as sampling 
triplets from a ground truth distribution at random. As 
such, taking stochastic steps over samples directly minimizes 
expected risk with respect to the ground truth distribution 
of triplets, not empirical risk with respect to the training 
instances. The practical impact of this characteristic is that 
stochastic methods tend to generalize better to unobserved 
samples. Eor more discussion on this characteristic of 
stochastic methods see 0. 

Note that our online formulation does not include trace 
regularization. Although this may impact our method 
in generalizing to unseen triplets, our online formulation 
achieves good generalization through carefully constructed. 


data-dependent step sizes Sj, as detailed in Section 4.3 


4.2 Efficient Projection In order to retain positive semi¬ 
definiteness, after taking a stochastic gradient step the 
resulting matrix K' must be projected onto the PSD cone. 
Eollowing the procedure of Hg^ is prohibitively expensive 
for our online setting. Instead, for RCKL methods we can 
take advantage of the sparse and low-rank nature of the 
gradient to devise an efficient projection scheme. To this end, 
we introduce a canonical gradient matrix G over a triplet 
t = {a, b, c)), where the entries are defined as: 


(4.10) 


= 


-2 

2 

1 

-1 

0 


if * = a, j = b or i = b, j = a 
if i = a,j = c or i = c, j = a 
if i = b,j = b 
if i = c,j = c 
otherwise. 


Now consider the following choice for the stochastic step: 


(4.11) V((K,f) = /(K,f)G, 


4.1 Stochastic Gradient Step To create an efficient and where / is a real-valued function defined below. With ( |4. ll| l 
online framework for RCKL - ERKLE - we leverage as the gradient in (|4~9]l, K,_i is updated by increasing entries 









corresponding to the similarity between objects a and b and 
decreasing the similarity between a and c by a factor of 

The function / can be defined such that we recover the 
gradients of I for different convex RCKL formulations. The 
stochastic gradient for STE can be obtained by dehning / as; 

(4.12) f{K,t) = l-pf 
Similarly by dehning / to be; 

(4.13) f{K,t) = l i if5(a,&) + l<4(«.c) 

^ > I 0 otherwise 

the stochastic gradient for GNMDS is obtained. Note, that 
this not only generalizes these two methods for use in our 
online framework but also suggests a simple way to create 
new online RCKL methods by designing a function / that 
weighs the contribution of individual triplets. 

Decomposing the online updates in such a way reveals 
a key insight into how to perform efficient projections onto 
the PSD cone after the stochastic step. Algorithm [T] outlines 
the procedure for efficient projection in ERKLE. Here, 
and are the smallest eigenvalue and eigenvector of matrix 
K, respectively. This procedure has a time complexity O(n^) 
due to Ending A^ and v^. To show that Algorithm does 
indeed perform a projection onto the PSD cone, we prove the 
following theorem; 

Theorem 4.1. Algorithm^results in a PSD matrix Kj that 
is closest to K' in terms of Frobenius distance. 

Proof. Let Kq G S'" (i.e. identity). We use this as our 
base case and show inductively that after each iteration of 
the main loop, Kj remains PSD. Let jj = 6jf {'Kj-i,tj) 
be the magnitude of an update. By ( |4.11)1 , the update in 
Equation ( |4.9| ) can be written as Kj_i — 7 jG. The only 
nonzero eigenvalues of —'yjG are Ai = S'yj and A 2 = 
—37j. It follows from Weyl’s inequality that the matrix 
K' = Kj_i — 7 jG has at most one negative eigenvalue. 
If K' has no negative eigenvalues, then it is PSD (line 6 of 
Algorithm^!). If K' has one negative eigenvalue, line 4 of 
Algorithmfllresults in a PSD matrix K.j that is closest to K' 
in terms offrobenius distance by Case 2 of Theorem 4 in ||4l. 

The important implication of Thm. |4.1| is that ERKLE 
can incorporate a triplet into a kernel in 0{nf) time by 
performing the efficient projection outlined in Algorithm [T] 
Furthermore, if a step is sufficiently small, then no projection 
is needed at all. Let A° be the smallest eigenvalue of K^. 
By WeyTs inequality, if A° — > 0, then all eigenvalues 

of K'are greater than or equal to 0. This can be used to 
skip the projection step when the update is known to result 
in a PSD matrix. In our algorithm, we lower bound the 
smallest eigenvalue by maintaining a conservative estimate 


Algorithm 1 Efficient PSD Projection 

1: procedure nj,_(K) 

2 : Find A^ and from K 

3 : if A^ < 0 then 

4 : return K — A^v^vJ 

5 : else 

6 : return K 

7 : end if 

8 : end procedure 


Aj. Initially, Ag ^ Ag. It is updated each iteration with it’s 
lower bound Aj G- Aj_i — 3jj. If Aj < 0, then Alg. [^is 
used to project onto the PSD cone and Aj <— max (0, A^). 
Otherwise, no projection is performed. In the case where 
Ag >> —3^j, this simple lower-bounding procedure can save 
many eigenvalue/eigenvector computations until a projection 
may be necessary. 

4.3 Passive-Aggressive Updates A key difference be¬ 
tween the batch and stochastic RCKL updates is the mag¬ 
nitude of the updates. For both methods the magnitude of 
the updates with respect to a single triplet f is a function of a 
learning rate and how well the previous solution satishes t. In 
the previous section we denoted the magnitude of an ERKLE 
update as jj. In the batch setting, the same learning rate 6i 
is used for all triplets in a given step. In contrast, stochastic 
methods typically use different learning rates Sj for different 
triplets tj, which can result in faster convergence rates. To 
take advantage of faster convergence, the learning rates must 
satisfy certain conditions. Early work 0 on the topic of 
learning rates suggest that Sj should satisfy two constraints; 
^^1 S'j < 00 and example Sj = Ijj 

satishes these constraints. Later work im suggests a more 
aggressive setting of Sj = 1/^/j. 

However, in the online setting there is no reason to 
believe that a triplet should have less inhuence on the kernel 
than those obtained before it. On the other hand, we do not 
wish to over-ht to the most recently obtained triplets. It is this 
observation that motivates Passive-Aggressive (PA) Online 
Learning 0 . In the RCKL setting, the general idea is that if 
the previous solution Kj_i satishes a newly obtained triplet 
tj = (a, b, c) by a margin of 1 , then do not update the kernel 
(passive). Otherwise, update the kernel so that the kernel is 
changed the minimal amount, but tj is satished by a margin of 
1 (aggressive). A fortunate side effect of choosing minimally 
sized updates is that updates are less likely to result in non- 
PSD matrices than larger steps, thus potentially reducing the 
number of projections onto the PSD cone via our conservative 
eigenvalue estimate (Section 4.2). 

To derive a passive-aggressive update for ERKLE, we 
wish to learn a magnitude of a stochastic step 7 ^ = 

Sjf{'Kj-i,tj) with passive-aggressive properties. / as de- 








fined by GNMDS in ( |4.13[ ) is inherently passive, but if Kj_i 
does not satisfy the margin constraint, it takes a step indepen¬ 
dent of how close the previous solution is to satisfying tj. As 
such, we wish to find a Sj that takes an aggressive step. We 
do this by solving the following optimization problem: 

dL, (a, b) + I < (a, c), 6j > 0 

By ( |4. ll| i and ( |4.13| l, the first constraint can be rewritten as: 

(4.15) ~ + 1 < 0 

With the assumption that the triplet is not satisfied by a margin 
of one in Kj_i, no update is required; otherwise, only a 
positive value of 6j can satisfy ( |4.15[ l, making the positive 
constraint on 5j redundant. Also, the smallest Sj that satisfies 
( |4.15| l is the one that makes the left hand side exactly zero. As 
a result, the inequality constraint can be handled as equality. 
To find the optimum we first write the Lagrangian C {Sj,a): 

(4.16) bj+a b) ~ c) — 10(5j + 

Taking the partial derivative of ( |4.16| l with respect to Sj, 
setting it to 0, and solving for Sj results in Sj = 5a. 
Substituting this back into ( |4.16| ) makes the Lagrangian: 

(4.17) — 25a^ + a (^dl^^__^{a,b) — 

Taking the partial derivative of ( |4.17| l with respect to a, 
setting it to 0, solving for a and then substituting this back 
into Sj = 5a results in the minimum step size that satisfies 
the margin constraint: 


(4.14) 


min 

s, 

s.t. 


(4.18) 



10 


A similar passive-aggressive update can be derived using 
the probability of a triplet being satisfied in STE. Consider 
the following optimization: 


min 

(4.19) 

s.t. Pt/>P,Sj>0 

In ( |4.19| ) the minimal step size is chosen such that the 
probability that a triplet is satisfied after the update is greater 
than or equal to a given probability P G (0.5,1). Using 
( |44^ , we derive the following step size: 


(4.20) 


< 5 , 


10 


where k = log (P) —log (1 — P). The full derivation is given 
in Sec. Both derivations reveal that passive-aggressive 


updates using STE and GNMDS are very similar. Setting 
P = in ( 4.20| i recovers the GNMDS passive-aggressive 


step in ( |4.18| l, and changing the margin in ( |4.18[ l recovers 
different settings of P. 

Note that using ( |4.18| ) as a step size results in a K' with 
the intended passive-aggressive property, not necessarily the 
kernel Kj after the projection. We choose to find a passive- 
aggressive step size instead of a full update for computational 
efficiency Einding a true passive-aggressive step size with 
respect to would require iteratively projecting onto the 
PSD cone, which is computationally prohibitive in the online 
setting. In practice, is a good approximation to , 

^3 ^ 

as their difference is dependent on the magnitude of the 
(potentially) negative eigenvalue of K' , which tends to be 
quite small. 

Even for a proper setting of Sj, it has been shown that 
stochastic methods perform best when multiple rounds of 
updates or passes are performed on the observed samples E 
IM1I281 . Eor our problem setting, this indicates that ERKLE 
may benefit from revisiting triplets that were previously used 
to update the kernel. In our experiments we perform a 
simple multi-pass scheme where for each new triplet, ERKLE 
not only steps over the most recently obtained triplet, but 
also a number of randomly sampled triplets from the set 
of previously obtained triplets. We denote the number of 
“passes” ERKLE performs each time a new triplet is observed 
as 13. Algorithm|^in Sec. [^describes this process in more 
detail. This simple approach is sufficient for maintaining high 
accuracy while still ensuring computational efficiency for the 
online setting. 


5 Experiments 

In this section, we evaluate ERKLE by comparing it to batch 
RCKL methods. Batch methods are not truly applicable to 
the online learning setting, but can be applied in what is often 
called “mini-batches”. In the mini-batch learning setting, 
every time a new batch of m triplets are received, batch 
RCKL is mn on all obtained triplets so far. Thus, we compare 
ERKLE to running their batch counterparts in mini-batches. 

We evaluate each method on four different data sets, each 
with its own challenges. Eirst, we start with a small-scale 
synthetic experiment to evaluate how the methods perform 
in an idealized setting. Second, a large-scale synthetic 
experiment is mn to show how ERKLE and batch compare in 
terms of practical run time. Third, a data set of triplets over 
popular music artists is used to evaluate how the methods 
perform in a real-world setting with moderate triplet noise. 
Finally, ERKLE and batch RCKL are evaluated on a data set 
of triplets over scene images, which consist of a small number 
of triplets, thus focusing on the performance of these methods 
with very little feedback. 

For these experiments, we wish to see how the learned 
kernels generalize to held out triplets as triplets are obtained. 





































(a) Comparison of ERKLE learning rates 5 


(b) Test Error vs. # of observed triplets 


(c) Test Error vs. # of effective passes 


Figure 1; Results from experiments on the small synthetic data set (10 trials) 


This is important in real-world applications where the goal 
is to accurately model all the relationships among objects, 
not just the observed ones. Because of this, one of our main 
evaluation metrics is normalized test error, which is defined 
as the total number of unsatisfied test triplets by a learned 
kernel divided by the total number of test triplets. 

Unless otherwise noted, the experiments were run with 
the following specifications. Each method started with an 
initial kernel set to identity in order to give no method an 
advantage (all methods initially satisfy no triplets). All batch 
methods were terminated after a maximum of 1000 iterations 
or when the change in objective between iterations was less 
than 10“^. We denote the batch methods with the suffix 
“-Batch” (e.g. STE-Batch) and the ERKLE variants with “- 
ERKLE” (e.g. STE-ERKLE). We denote passive-aggressive 
ERKLE as PA-ERKLE, and use the step size that satisfies the 
margin by 1 as in ( |4.181 l. The mini-batch size is 100, and all 
methods are evaluated every 100 observed triplets. 

We used the batch STE, GNMDS, and CKL (Crowd 
Kernel Learning ll25l l MATLAB implementations specified 
by ll26l in which the eig MATLAB function is used to find the 
all eigenvalues and eigenvectors for projection onto the PSD 
cone. ERKLE was also implemented in MATLAB, where the 
eigs function is used to find a single eigenvalue/eigenvector 
pair with the smallest eigenvalue for the projections. The 
r hyperparameter was chosen to be the best performing 
setting over ten varying options. The timed experiments were 
performed on an Intel Core i5-4670K CPU @3.4 GHz with 
16 GB of RAM and the single thread option enabled. Each 
experiment was performed with ten trials, each with different, 
randomly chosen test, train and validation sets. The error bars 
in the graphs represent the 95% confidence interval. 

5.1 Small-Scale Synthetic Data Our first experiment is to 
test each method on an ideal, small-scale, synthetic data set. 
We created the synthetic data set by first generating 100 data 
points (n = 100) in from N (0,1). Using the distances 


between points, we answered all possible relative comparison 
queries which resulted in 485100 triplets. 10000 triplets were 
used as the train set and the rest were used as the test set. 

Discussion: Figure [T^ shows the effect that the learning 
rate parameter 5j has on the performance of ERKLE as more 
triplets are observed in an online fashion. Eor a setting 
of 1/j, the learning rate decays too rapidly to improve 
performance significantly after j = 3000. The learning rate 
1 / ^/J performs better, but still levels off, quicker than the 
final two methods. The last two methods have learning rates 
that are independent of the number of observed triplets. STE- 
ERKLE with a constant learning rate and PA-ERKLE take 
steps solely based on how well the current solution satisfies 
the observed triplet, and vastly outperform the alternative 
learning rates based on number of observations. This result 
indicates that reducing the influence of a triplet because it 
was observed later has an adverse effect on the ability of a 
learned kernel to generalize to unobserved triplets. 

Figure shows the performance of STE-ERKLE (with 
Sj set to 1), and PA-ERKLE compared to three batch RCKL 
methods. The r hyperparameter was chosen by selecting the 
best setting over choices as evaluated on the test set. With 
a single pass over the data (/3 = 1), both ERKLE methods 
outperformed all batch methods slightly. With ten passes 
over the data, the ERKLE methods outperformed the batch 
methods by a large margin. In addition, the batch methods 
level off more quickly than the ERKLE methods, indicating 
that if more triplets were obtained, the ERKLE methods 
would further outperform even the batch methods. We believe 
that these results show that by minimizing the expected risk 
directly, ERKLE is able to learn a more general kernel than 
batch methods that minimize empirical risk. 

Eigure[l^ shows the performance of two ERKLE meth¬ 
ods and two batch RCKL methods as a function of how many 
effective “passes” each method performed on the data. Eor 
ERKLE, this amounts to the setting of the /3 parameter. Eor 
the batch RCKL methods, this is the number of full gradient 
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(a) Run time (seconds) vs. number of observed triplets (b) Test error vs. # of observed triplets 

Figure 2; Results from experiments on the large-scale synthetic data set (5 trials) 


steps it takes. Each method was run over all training triplets 
with the step size Sj validated on the test set for the batch 
methods. This effectively measures training cost as a func¬ 
tion of passes through the data, and thus, is independent of 
implementation. Clearly, if only few passes through the data 
can be performed, then ERKLE is the better choice. 

5.2 Large-Scale Synthetic Data Next, we evaluated how 
PA-ERKLE compared to batch GNMDS in terms of practical 
run time on a large scale experiment. For this experiment, we 
generated 5000 data points in the same manner in which the 
small-scale synthetic data was generated. 10000 randomly 
generated triplets were used as the train set and 50000 
were used as the test set. The batch methods were run in 
mini-batches of 500 triplets due to time constraints. The 
hyperparameter r and the step size 6i were chosen as the 
settings that best performed on the test set. This experiment 
was run over 5 trials, each with a different train and test set. 

Discussion: Figure|^shows the cumulative run time of 
one pass of PA-ERKLE, and 1 and 2 steps of batch GNMDS. 
The times shown for the batch methods are for the best chosen 
r and not for the total time it took to find it. The figure shows 
that a single pass of PA-ERKLE is often significantly faster 
than a single gradient step of batch GNMDS. Two steps of 
GNMDS takes even longer. ERKLE can perform online 
updates much faster due to the efficient projection procedure 
as well as the ability to skip certain projections by estimating 
the lower bound. In this experiment, the mean number of 
eigenvalue/eigenvector computations over the 5 trials was 

724.2 with a standard deviation of 3.7. Hence PA-ERKLE 
was able to skip the projection step roughly 93% of the time. 

Figurej^depicts the test errors of each method. Initially, 
the batch methods perform better, but around 2500 triplets, 
PA-ERKLE outperforms the batch methods. This experiment 
indicates that PA-ERKLE can achieve competitive results 
with batch methods in a single pass over the data, and produce 
truly online solutions instead of mini-batch solutions while 


having faster run time. 

5.3 Music Artist Similarity Eor the last two experiments 
we performed evaluations on real-world data sets. Eirst, we 
performed an experiment using relative comparisons among 
popular music artists gathered from a web survey. The 
aset400 data set m contains 16,385 relative comparisons 
over 412 artists. We randomly chose 10000 triplets as the 
train set, 1000 as the validation set for the r parameter, and 
the rest were used as the test set. The aset400 data set 
presents a challenge not present in the synthetic data: It has a 
moderate amount of conflicting triplets, thus methods used in 
the evaluation must deal with noise within the triplets. 

Discussion: Eigure shows how ERKLE and batch 
RCKL methods generalize to the test set. STE-ERKLE 
performs considerably worse than the other methods, most 
likely due to the noise in the observed triplets. The probability 
pY' used in STE-ERKLE decays rapidly. Thus, triplets that 
are in agreement with previously obtained triplets do not 
influence the learned kernel greatly. However, a conflicting 
triplet will make STE-ERKLE perform a relatively more 
drastic update. PA-ERKLE, however, is much more robust 
to noise due to the minimal step size taken to satisfy a triplet. 
Because of this, PA-ERKLE performs as well as the batch 
methods and often better when multiple passes are taken. 

Figure[3b| shows the training errors of each method. This 
figure highlights how well each method fits to the observed 
triplets. The STE-ERKLE models are greatly effected by the 
presence of conflicts in that they do not learn a kernel that fits 
to a large number of the observed triplets. PA-ERKLE, on the 
other hand, is able to fit better to the set of observed triplets, 
thus resulting in better test accuracy, as well. 

As previously discussed, dissimilar from batch methods 
ERKLE does not use trace regularization. Experimentally, 
however, we nevertheless find that our method outperforms 
batch methods that use trace regularization, in either produc¬ 
ing low-rank or high-rank kernels. To demonstrate this, in 
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Figure 3: Results from experiments on the aset400 data set (10 trials) 


Figure|^we plot the ranks of the kernels learned by the batch 
methods. In our experiments, the range of potential r values 
was set so that the batch methods never chose either the up¬ 
per or lower bound. We did this to ensure that the range of 
regularization options were sufficiently strict or lenient. We 
observe that the batch methods generally produce low-rank 
kernels under a small number of triplets, but as the number 
of triplets are observed the rank increases. Our method is 
able to better generalize without using trace regularization, 
regardless of the preferred rank, due to the PA updates only 
satisfying triplets to the necessary extent. 


5.4 Outdoor Scene Similarity Our final experiment used 
triplets over 200 randomly chosen images of scenes from the 
Outdoor Scene Recognition (OSR) data set am. Relative 
comparison queries were posed to 20 people via an online 
system. After an initial 1200 randomly chosen queries 
were answered (every object appeared as the head of a 
triplet 6 times), 20 “rounds” of 200 triplets were adaptively 
chosen according to the adaptive selection criterion in 11251 . 
resulting in 3600 total triplets. For each trial of this 
experiment, 1000 triplets were randomly chosen as the 
test set, 1000 as the train set, and 600 were used as the 
validation set for the r parameter. This experiment is 
especially challenging for two reasons. First, this is the 
smallest experiment in terms of triplets, highlighting how 
the methods perform with little feedback. In addition, the 
adaptive selection algorithm chooses relative comparison 
queries with the highest information gain, meaning, the 
triplets are intentionally chosen to give disparate information 
about the relationships among objects. 

Discussion: Figure 4a depicts test errors on each 
method. We observe that STE-ERKLE consistently outper¬ 
forms STE-Batch, and in particular STE-ERKLE performs 
well under a small number of triplets relative to all other 
methods. PA-ERKLE is comparable or outperforms its batch 
counterpart in GNMDS-Batch, given enough triplets (at least 
500). However, PA-ERKLE performs quite well in training 


error compared to all other methods, indicating that even in 
such a challenging scenario, the passive-aggressive update 
scheme minimally interferes with previously obtained triplets. 

6 Conclusion and Future Work 

In this work, we developed a method to learn a PSD kernel 
matrix from relative comparisons given in an online fashion. 
By taking advantage of the sparse and low-rank structure 
of the online formulation, we show how to take stochastic 
gradient descent updates of complexity 0{v?). We show 
how passive-aggressive online learning benefits our method 
in terms of generalizing to unseen triplets, and in conjunction 
with the stochastic gradient structure, enables us to perform 
a small number of necessary PSD projections in practice. 
Experimentally, we show on synthetic and real-world data that 
our method learns kernels that generalize as well and often 
better to held out relative comparisons than batch methods, 
while demonstrating improved run-time performance. 

Eor future work, we wish to improve online RCKL 
in three ways. Eirst, will explore the use of online trace 
regularization. If trace regularization is naively applied to 
the stochastic gradient in ( |4.9| ), the update becomes full- 
rank and our efficient projection procedure cannot be used. 
However, an efficient update scheme should be possible if the 
kernel itself is low-rank. We will investigate novel methods 
for appropriately weighting the trace in an online manner, 
so that we are consistent with the parameter-free property 
of PA-ERKLE. Second, PA-ERKLE performed well in our 
experiments with moderate triplet noise, however, it could 
be beneficial to explicitly handle conflicting triplets when 
they are observed. This can be done out of model using a 
denoising method El, or in model using a threshold on 
the passive-aggressive learning rate. Einally, one of the 
main benefits of having an online learning algorithm is the 
natural application of active learning methods. Prior work 
has proposed an adaptive selection scheme which operates in 
mini-batches ll25]l : however, such a scheme is too expensive 
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Figure 4; Results from experiments on the OSR data set (10 trials) 


to be applied online. We will investigate novel adaptive triplet 
selection methods which are both efficient and informative. 

A Derivation of STE Passive-Aggressive Step Size 

To derive the STE version of the passive-aggressive step size 
we wish to solve the following optimization (4.19): 

min S'^ 

s.t. pf/>P,Sj>0 

As with the GNMDS derivation with the assumption that the 
triplet is not satisfied by a probability greater than or equal to 
P, only a positive value of Sj can satisfy the first constraint, 
making the positive constraint on 6j redundant. In addition, 
the smallest 6j that satisfies the remaining constraint is the 
one that makes the left hand side exactly zero. As a result, 
the inequality constraint can be handled as equality. Next, we 
take the Lagrangian: 

-fa(log(P)-log(l-P)) 

+a (dK,_i (a, b) - dK,_i (a, c)) - 106 ja 

Taking the partial derivative of the Lagrangian with respect 
to Sj, setting it to 0, and solving for Sj results in Sj = 5a. 
Substituting this back into the Lagrangian makes it: 

—25q;^ -I-Q! (log (P) — log (1 — P)) 

+a (dKj_i (a, b) - c)) 

Taking the partial derivative of the Lagrangian with respect 
to a, setting it to 0, and solving for a results in: 

log (P) - log (1 - P) -f dji (a, b) - dK {a, c) 

a -- - - - - 

50 

Substituting this into the solution for Sj gives us: 

. _ log (P) - log (1 - P) + (a, b) - (a, c) 


B ERKLE with Multiple Passes 


Algorithm 2 ERKLE with Multiple Passes 

Input: /3 : # of triplets stepped over 

1 

Ko^I 

2 

for j = 1,2, ... do 

3 

t 

1 

1 

< 

1 

4 

K, ^ (k;.) 

5 

if j > 2/3 then 

6 

for fc = 1,2, ...,/3 — 1 do 

7 

Randomly select t' from 

8 


9 

(K') 

10 

end for 

11 

end if 

12 

end for 


Algorithm 1 is much like the original ERKLE algorithm. 
Here, after a sufficient number of triplets have been obtained 
(in our experiments, we chose 2/3), /3 — 1 triplets are selected 
every iteration from all previously observed triplets (for a 
total of /3 updates per iteration). These triplets are stepped 
over as done in the original ERKLE algorithm. For our 
random selection used in our experiments, we simply selected 
uniformly at random with replacement from the obtained 
triplets. More sophisticated random selection procedures 
may be used in order ensure triplets obtained initially do 
not get selected drastically more times than those obtained 
later. For instance, when a triplet gets chosen on line 7, 
one could reduce the probability of that triplet being chosen 
subsequently. 
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